Biostat 212a Homework 6

Due Mar 22, 2024 @ 11:59PM

Author

Yangn An and UID: 106332601

Published

March 19, 2024

Load R libraries.

rm(list = ls())
library(tidyverse)
library(tidymodels)
library(readr)
library(tswge)
library(ggplot2)
library(yardstick)

acfdf <- function(vec) {
    vacf <- acf(vec, plot = F)
    with(vacf, data.frame(lag, acf))
}

ggacf <- function(vec) {
    ac <- acfdf(vec)
    ggplot(data = ac, aes(x = lag, y = acf)) + geom_hline(aes(yintercept = 0)) + 
        geom_segment(mapping = aes(xend = lag, yend = 0))
}

tplot <- function(vec) {
    df <- data.frame(X = vec, t = seq_along(vec))
    ggplot(data = df, aes(x = t, y = X)) + geom_line()
}

1 New York Stock Exchange (NYSE) data (1962-1986) (140 pts)

Figure 1: Historical trading statistics from the New York Stock Exchange. Daily values of the normalized log trading volume, DJIA return, and log volatility are shown for a 24-year period from 1962-1986. We wish to predict trading volume on any day, given the history on all earlier days. To the left of the red bar (January 2, 1980) is training data, and to the right test data.

The NYSE.csv file contains three daily time series from the New York Stock Exchange (NYSE) for the period Dec 3, 1962-Dec 31, 1986 (6,051 trading days).

  • Log trading volume (\(v_t\)): This is the fraction of all outstanding shares that are traded on that day, relative to a 100-day moving average of past turnover, on the log scale.

  • Dow Jones return (\(r_t\)): This is the difference between the log of the Dow Jones Industrial Index on consecutive trading days.

  • Log volatility (\(z_t\)): This is based on the absolute values of daily price movements.

# Read in NYSE data from url

url = "https://raw.githubusercontent.com/ucla-biostat-212a/2024winter/master/slides/data/NYSE.csv"
NYSE <- read_csv(url)

NYSE
# A tibble: 6,051 × 6
   date       day_of_week DJ_return log_volume log_volatility train
   <date>     <chr>           <dbl>      <dbl>          <dbl> <lgl>
 1 1962-12-03 mon         -0.00446     0.0326           -13.1 TRUE 
 2 1962-12-04 tues         0.00781     0.346            -11.7 TRUE 
 3 1962-12-05 wed          0.00384     0.525            -11.7 TRUE 
 4 1962-12-06 thur        -0.00346     0.210            -11.6 TRUE 
 5 1962-12-07 fri          0.000568    0.0442           -11.7 TRUE 
 6 1962-12-10 mon         -0.0108      0.133            -10.9 TRUE 
 7 1962-12-11 tues         0.000124   -0.0115           -11.0 TRUE 
 8 1962-12-12 wed          0.00336     0.00161          -11.0 TRUE 
 9 1962-12-13 thur        -0.00330    -0.106            -11.0 TRUE 
10 1962-12-14 fri          0.00447    -0.138            -11.0 TRUE 
# ℹ 6,041 more rows

The autocorrelation at lag \(\ell\) is the correlation of all pairs \((v_t, v_{t-\ell})\) that are \(\ell\) trading days apart. These sizable correlations give us confidence that past values will be helpful in predicting the future.

Code
ggacf(NYSE$log_volume) + ggthemes::theme_few()

Figure 2: The autocorrelation function for log volume. We see that nearby values are fairly strongly correlated, with correlations above 0.2 as far as 20 days apart.

Do a similar plot for (1) the correlation between \(v_t\) and lag \(\ell\) Dow Jones return \(r_{t-\ell}\) and (2) correlation between \(v_t\) and lag \(\ell\) Log volatility \(z_{t-\ell}\).

Code
seq(1, 30) %>% 
  map(function(x) {cor(NYSE$log_volume , lag(NYSE$DJ_return, x), use = "pairwise.complete.obs")}) %>% 
  unlist() %>% 
  tibble(lag = 1:30, cor = .) %>% 
  ggplot(aes(x = lag, y = cor)) + 
  geom_hline(aes(yintercept = 0)) + 
  geom_segment(mapping = aes(xend = lag, yend = 0)) + 
  ggtitle("AutoCorrelation between `log volume` and lagged `DJ return`")

Figure 3: Correlations between log_volume and lagged DJ_return.
Code
seq(1, 30) %>% 
  map(function(x) {cor(NYSE$log_volume , lag(NYSE$log_volatility, x), use = "pairwise.complete.obs")}) %>% 
  unlist() %>% 
  tibble(lag = 1:30, cor = .) %>% 
  ggplot(aes(x = lag, y = cor)) + 
  geom_hline(aes(yintercept = 0)) + 
  geom_segment(mapping = aes(xend = lag, yend = 0)) + 
  ggtitle("AutoCorrelation between `log volume` and lagged `log volatility`")

Figure 4: Weak correlations between log_volume and lagged log_volatility.

1.1 Project goal

Our goal is to forecast daily Log trading volume, using various machine learning algorithms we learnt in this class.

The data set is already split into train (before Jan 1st, 1980, \(n_{\text{train}} = 4,281\)) and test (after Jan 1st, 1980, \(n_{\text{test}} = 1,770\)) sets.

In general, we will tune the lag \(L\) to acheive best forecasting performance. In this project, we would fix \(L=5\). That is we always use the previous five trading days’ data to forecast today’s log trading volume.

Pay attention to the nuance of splitting time series data for cross validation. Study and use the time-series functionality in tidymodels. Make sure to use the same splits when tuning different machine learning algorithms.

Use the \(R^2\) between forecast and actual values as the cross validation and test evaluation criterion.

1.2 Baseline method (20 pts)

We use the straw man (use yesterday’s value of log trading volume to predict that of today) as the baseline method. Evaluate the \(R^2\) of this method on the test data.

Before we tune different machine learning methods, let’s first separate into the test and non-test sets. We drop the first 5 trading days which lack some lagged variables.

# Lag: look back L trading days
# Do not need to include, as we included them in receipe
L = 5

for(i in seq(1, L)) {
  NYSE <- NYSE %>% 
    mutate(!!paste("DJ_return_lag", i, sep = "") := lag(NYSE$DJ_return, i),
           !!paste("log_volume_lag", i, sep = "") := lag(NYSE$log_volume, i),
           !!paste("log_volatility_lag", i, sep = "") := lag(NYSE$log_volatility, i))
}


NYSE <-   NYSE %>% na.omit()
# Drop beginning trading days which lack some lagged variables
NYSE_other <- NYSE %>% 
  filter(train == 'TRUE') %>%
  select(-train) %>%
  drop_na()
dim(NYSE_other)
[1] 4276   20
NYSE_test = NYSE %>% 
  filter(train == 'FALSE') %>%
  select(-train) %>%
  drop_na()
dim(NYSE_test)
[1] 1770   20
library(yardstick)
# cor(NYSE_test$log_volume, NYSE_test$log_volume_lag1) %>% round(2)
r2_test_strawman =  rsq_vec(NYSE_test$log_volume, lag(NYSE_test$log_volume, 1)) %>% round(2)
print(paste("Straw man test R2: ", r2_test_strawman))
[1] "Straw man test R2:  0.35"

1.3 Autoregression (AR) forecaster (30 pts)

  • Let \[ y = \begin{pmatrix} v_{L+1} \\ v_{L+2} \\ v_{L+3} \\ \vdots \\ v_T \end{pmatrix}, \quad M = \begin{pmatrix} 1 & v_L & v_{L-1} & \cdots & v_1 \\ 1 & v_{L+1} & v_{L} & \cdots & v_2 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & v_{T-1} & v_{T-2} & \cdots & v_{T-L} \end{pmatrix}. \]

  • Fit an ordinary least squares (OLS) regression of \(y\) on \(M\), giving \[ \hat v_t = \hat \beta_0 + \hat \beta_1 v_{t-1} + \hat \beta_2 v_{t-2} + \cdots + \hat \beta_L v_{t-L}, \] known as an order-\(L\) autoregression model or AR(\(L\)).

  • Before we start the model training, let’s talk about time series resampling. We will use the rolling_origin function in the rsample package to create a time series cross-validation plan.

  • When the data have a strong time component, a resampling method should support modeling to estimate seasonal and other temporal trends within the data. A technique that randomly samples values from the training set can disrupt the model’s ability to estimate these patterns.

NYSE %>% 
  ggplot(aes(x = date, y = log_volume)) + 
  geom_line() + 
  geom_smooth(method = "lm")

wrong_split <- initial_split(NYSE_other)

bind_rows(
  training(wrong_split) %>% mutate(type = "train"),
  testing(wrong_split) %>% mutate(type = "test")
) %>% 
  ggplot(aes(x = date, y = log_volume, color = type, group = NA)) + 
  geom_line()

correct_split <- initial_time_split(NYSE_other %>% arrange(date))

bind_rows(
  training(correct_split) %>% mutate(type = "train"),
  testing(correct_split) %>% mutate(type = "test")
) %>% 
  ggplot(aes(x = date, y = log_volume, color = type, group = NA)) + 
  geom_line()

rolling_origin(NYSE_other %>% arrange(date), initial = 30, assess = 7) %>%
#sliding_period(NYSE_other %>% arrange(date), date, period = "day", lookback = Inf, assess_stop = 1) %>% 
  mutate(train_data = map(splits, analysis),
         test_data = map(splits, assessment)) %>% 
  select(-splits) %>% 
  pivot_longer(-id) %>% 
  filter(id %in% c("Slice0001", "Slice0002", "Slice0003")) %>% 
  unnest(value) %>% 
  ggplot(aes(x = date, y = log_volume, color = name, group = NA)) + 
  geom_point() + 
  geom_line() +
  facet_wrap(~id, scales = "fixed")

sliding_period(NYSE_other %>% arrange(date), 
               date, period = "month", lookback = Inf, assess_stop = 1) %>% 
  mutate(train_data = map(splits, analysis),
         test_data = map(splits, assessment)) %>% 
  select(-splits) %>% 
  pivot_longer(-id) %>% 
  filter(id %in% c("Slice001", "Slice002", "Slice003")) %>% 
  unnest(value) %>% 
  ggplot(aes(x = date, y = log_volume, color = name, group = NA)) + 
  geom_point() +
  geom_line() + 
  facet_wrap(~id, scales = "fixed")

  • Rolling forecast origin resampling (Hyndman and Athanasopoulos 2018) provides a method that emulates how time series data is often partitioned in practice, estimating the model with historical data and evaluating it with the most recent data.

  • Tune AR(5) with elastic net (lasso + ridge) regularization using all 3 features on the training data, and evaluate the test performance.

1.4 Preprocessing

en_receipe <- 
  recipe(log_volume ~ ., data = NYSE_other) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_normalize(all_numeric_predictors(), -all_outcomes()) %>%  
  step_naomit(all_predictors()) %>%
  prep(data = NYSE_other)
en_receipe

1.5 Model training

### Model
enet_mod <- 
  # mixture = 0 (ridge), mixture = 1 (lasso)
  # mixture = (0, 1) elastic net 
  # As an example, we set mixture = 0.5. It needs to be tuned.
  linear_reg(penalty = tune(), mixture = 0.5) %>% 
  set_engine("glmnet")
enet_mod
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune()
  mixture = 0.5

Computational engine: glmnet 
en_wf <- 
  workflow() %>%
  add_model(enet_mod) %>%
  add_recipe(en_receipe %>% step_rm(date) %>% step_indicate_na())
en_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_dummy()
• step_normalize()
• step_naomit()
• step_rm()
• step_indicate_na()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune()
  mixture = 0.5

Computational engine: glmnet 
folds <- NYSE_other %>% arrange(date) %>%
    sliding_period(date, period = "month", lookback = Inf, assess_stop = 1)
  # rolling_origin(initial = 5, assess = 1)
folds
# Sliding period resampling 
# A tibble: 204 × 2
   splits           id      
   <list>           <chr>   
 1 <split [15/22]>  Slice001
 2 <split [37/19]>  Slice002
 3 <split [56/21]>  Slice003
 4 <split [77/21]>  Slice004
 5 <split [98/22]>  Slice005
 6 <split [120/20]> Slice006
 7 <split [140/22]> Slice007
 8 <split [162/22]> Slice008
 9 <split [184/20]> Slice009
10 <split [204/23]> Slice010
# ℹ 194 more rows
month_folds <- NYSE_other %>%
  sliding_period(
    date,
    "month",
    lookback = Inf,
    skip = 4)
month_folds
# Sliding period resampling 
# A tibble: 200 × 2
   splits           id      
   <list>           <chr>   
 1 <split [98/22]>  Slice001
 2 <split [120/20]> Slice002
 3 <split [140/22]> Slice003
 4 <split [162/22]> Slice004
 5 <split [184/20]> Slice005
 6 <split [204/23]> Slice006
 7 <split [227/18]> Slice007
 8 <split [245/21]> Slice008
 9 <split [266/22]> Slice009
10 <split [288/19]> Slice010
# ℹ 190 more rows
lambda_grid <-
  grid_regular(penalty(range = c(-8, -7), trans = log10_trans()), levels = 3)
lambda_grid
# A tibble: 3 × 1
       penalty
         <dbl>
1 0.00000001  
2 0.0000000316
3 0.0000001   
en_fit <- tune_grid(en_wf, resamples = month_folds, grid = lambda_grid) 
en_fit
# Tuning results
# Sliding period resampling 
# A tibble: 200 × 4
   splits           id       .metrics         .notes          
   <list>           <chr>    <list>           <list>          
 1 <split [98/22]>  Slice001 <tibble [6 × 5]> <tibble [0 × 3]>
 2 <split [120/20]> Slice002 <tibble [6 × 5]> <tibble [0 × 3]>
 3 <split [140/22]> Slice003 <tibble [6 × 5]> <tibble [0 × 3]>
 4 <split [162/22]> Slice004 <tibble [6 × 5]> <tibble [0 × 3]>
 5 <split [184/20]> Slice005 <tibble [6 × 5]> <tibble [0 × 3]>
 6 <split [204/23]> Slice006 <tibble [6 × 5]> <tibble [0 × 3]>
 7 <split [227/18]> Slice007 <tibble [6 × 5]> <tibble [0 × 3]>
 8 <split [245/21]> Slice008 <tibble [6 × 5]> <tibble [0 × 3]>
 9 <split [266/22]> Slice009 <tibble [6 × 5]> <tibble [0 × 3]>
10 <split [288/19]> Slice010 <tibble [6 × 5]> <tibble [0 × 3]>
# ℹ 190 more rows
en_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "rsq") %>%
  ggplot(mapping = aes(x = penalty, y = mean)) + 
  geom_point() + 
  geom_line() + 
  labs(x = "Penalty", y = "CV RMSE") + 
  scale_x_log10(labels = scales::label_number())
# A tibble: 6 × 7
       penalty .metric .estimator  mean     n std_err .config             
         <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 0.00000001   rmse    standard   0.133   200 0.00287 Preprocessor1_Model1
2 0.00000001   rsq     standard   0.381   200 0.0149  Preprocessor1_Model1
3 0.0000000316 rmse    standard   0.133   200 0.00287 Preprocessor1_Model2
4 0.0000000316 rsq     standard   0.381   200 0.0149  Preprocessor1_Model2
5 0.0000001    rmse    standard   0.133   200 0.00287 Preprocessor1_Model3
6 0.0000001    rsq     standard   0.381   200 0.0149  Preprocessor1_Model3

en_fit %>%
  show_best("rsq")
# A tibble: 3 × 7
       penalty .metric .estimator  mean     n std_err .config             
         <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 0.00000001   rsq     standard   0.381   200  0.0149 Preprocessor1_Model1
2 0.0000000316 rsq     standard   0.381   200  0.0149 Preprocessor1_Model2
3 0.0000001    rsq     standard   0.381   200  0.0149 Preprocessor1_Model3
best_en_fit <- en_fit %>%
  select_best("rsq")
best_en_fit
# A tibble: 1 × 2
     penalty .config             
       <dbl> <chr>               
1 0.00000001 Preprocessor1_Model1
# Final workflow
final_wf_en <- en_wf %>%
  finalize_workflow(best_en_fit)
final_wf_en
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_dummy()
• step_normalize()
• step_naomit()
• step_rm()
• step_indicate_na()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 1e-08
  mixture = 0.5

Computational engine: glmnet 
# Fit the whole training set, then predict the test cases
final_fit_en <- 
  final_wf_en %>%
  last_fit(correct_split)
final_fit_en
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits              id               .metrics .notes   .predictions .workflow 
  <list>              <chr>            <list>   <list>   <list>       <list>    
1 <split [3207/1069]> train/test split <tibble> <tibble> <tibble>     <workflow>
# Test metrics
final_fit_en %>% collect_metrics()
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       0.138 Preprocessor1_Model1
2 rsq     standard       0.688 Preprocessor1_Model1
predictions <- final_fit_en %>%
  collect_predictions()
predictions
# A tibble: 1,069 × 5
   id                 .pred  .row log_volume .config             
   <chr>              <dbl> <int>      <dbl> <chr>               
 1 train/test split -0.135   3208     0.0353 Preprocessor1_Model1
 2 train/test split -0.0388  3209     0.0338 Preprocessor1_Model1
 3 train/test split -0.106   3210    -0.141  Preprocessor1_Model1
 4 train/test split -0.0471  3211    -0.352  Preprocessor1_Model1
 5 train/test split -0.160   3212     0.154  Preprocessor1_Model1
 6 train/test split  0.0224  3213    -0.167  Preprocessor1_Model1
 7 train/test split -0.116   3214     0.102  Preprocessor1_Model1
 8 train/test split -0.0778  3215    -0.0838 Preprocessor1_Model1
 9 train/test split -0.0669  3216    -0.247  Preprocessor1_Model1
10 train/test split -0.0560  3217     0.205  Preprocessor1_Model1
# ℹ 1,059 more rows

1.6 Random forest forecaster (30pts)

  • Use the same features as in AR(\(L\)) for the random forest. Tune the random forest and evaluate the test performance.

  • Hint: Workflow: Random Forest for Prediction is a good starting point.

rf_recipe <- recipe(log_volume ~ ., data = NYSE_other) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_normalize(all_numeric_predictors(), -all_outcomes()) %>%  
  step_naomit(all_predictors()) %>%
  prep(data = NYSE_other)
rf_recipe
rf_mod <- 
  rand_forest(
    mode = "regression",
    # Number of predictors randomly sampled in each split
    mtry = tune(),
    # Number of trees in ensemble
    trees = tune()
  ) %>% 
  set_engine("ranger")
rf_mod
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = tune()

Computational engine: ranger 
rf_wf <- workflow() %>%
   add_recipe(rf_recipe %>% step_rm(date) %>% step_indicate_na()) %>%
  add_model(rf_mod)
rf_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_dummy()
• step_normalize()
• step_naomit()
• step_rm()
• step_indicate_na()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = tune()

Computational engine: ranger 
folds <- NYSE_other %>% arrange(date) %>%
    sliding_period(date, period = "month", lookback = Inf, assess_stop = 1)
  # rolling_origin(initial = 5, assess = 1)
folds
# Sliding period resampling 
# A tibble: 204 × 2
   splits           id      
   <list>           <chr>   
 1 <split [15/22]>  Slice001
 2 <split [37/19]>  Slice002
 3 <split [56/21]>  Slice003
 4 <split [77/21]>  Slice004
 5 <split [98/22]>  Slice005
 6 <split [120/20]> Slice006
 7 <split [140/22]> Slice007
 8 <split [162/22]> Slice008
 9 <split [184/20]> Slice009
10 <split [204/23]> Slice010
# ℹ 194 more rows
month_folds <- NYSE_other %>%
  sliding_period(
    date,
    "month",
    lookback = Inf,
    skip = 4)
month_folds
# Sliding period resampling 
# A tibble: 200 × 2
   splits           id      
   <list>           <chr>   
 1 <split [98/22]>  Slice001
 2 <split [120/20]> Slice002
 3 <split [140/22]> Slice003
 4 <split [162/22]> Slice004
 5 <split [184/20]> Slice005
 6 <split [204/23]> Slice006
 7 <split [227/18]> Slice007
 8 <split [245/21]> Slice008
 9 <split [266/22]> Slice009
10 <split [288/19]> Slice010
# ℹ 190 more rows
rf_grid <- grid_regular(
  trees(range = c(100L, 300L)), 
  mtry(range = c(1L, 5L)),
  levels = c(3, 5)
  )
rf_grid
# A tibble: 15 × 2
   trees  mtry
   <int> <int>
 1   100     1
 2   200     1
 3   300     1
 4   100     2
 5   200     2
 6   300     2
 7   100     3
 8   200     3
 9   300     3
10   100     4
11   200     4
12   300     4
13   100     5
14   200     5
15   300     5
rf_fit <- tune_grid(
  rf_wf,
  resamples = month_folds,
  grid = rf_grid,
 metrics = metric_set(rmse, rsq)  
)
rf_fit
# Tuning results
# Sliding period resampling 
# A tibble: 200 × 4
   splits           id       .metrics          .notes          
   <list>           <chr>    <list>            <list>          
 1 <split [98/22]>  Slice001 <tibble [30 × 6]> <tibble [0 × 3]>
 2 <split [120/20]> Slice002 <tibble [30 × 6]> <tibble [0 × 3]>
 3 <split [140/22]> Slice003 <tibble [30 × 6]> <tibble [0 × 3]>
 4 <split [162/22]> Slice004 <tibble [30 × 6]> <tibble [0 × 3]>
 5 <split [184/20]> Slice005 <tibble [30 × 6]> <tibble [0 × 3]>
 6 <split [204/23]> Slice006 <tibble [30 × 6]> <tibble [0 × 3]>
 7 <split [227/18]> Slice007 <tibble [30 × 6]> <tibble [0 × 3]>
 8 <split [245/21]> Slice008 <tibble [30 × 6]> <tibble [0 × 3]>
 9 <split [266/22]> Slice009 <tibble [30 × 6]> <tibble [0 × 3]>
10 <split [288/19]> Slice010 <tibble [30 × 6]> <tibble [0 × 3]>
# ℹ 190 more rows
rf_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "rsq") %>%
  mutate(mtry = as.factor(mtry)) %>%
  ggplot(mapping = aes(x = trees, y = mean, color = mtry)) +
  # geom_point() + 
  geom_line() + 
  labs(x = "Num. of Trees", y = "CV mse")
# A tibble: 30 × 8
    mtry trees .metric .estimator  mean     n std_err .config              
   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1     1   100 rmse    standard   0.161   200 0.00385 Preprocessor1_Model01
 2     1   100 rsq     standard   0.312   200 0.0142  Preprocessor1_Model01
 3     1   200 rmse    standard   0.161   200 0.00388 Preprocessor1_Model02
 4     1   200 rsq     standard   0.312   200 0.0141  Preprocessor1_Model02
 5     1   300 rmse    standard   0.161   200 0.00389 Preprocessor1_Model03
 6     1   300 rsq     standard   0.318   200 0.0143  Preprocessor1_Model03
 7     2   100 rmse    standard   0.148   200 0.00330 Preprocessor1_Model04
 8     2   100 rsq     standard   0.322   200 0.0145  Preprocessor1_Model04
 9     2   200 rmse    standard   0.147   200 0.00334 Preprocessor1_Model05
10     2   200 rsq     standard   0.331   200 0.0145  Preprocessor1_Model05
# ℹ 20 more rows

# Final workflow
rf_fit %>%
  show_best("rsq")
# A tibble: 5 × 8
   mtry trees .metric .estimator  mean     n std_err .config              
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1     5   300 rsq     standard   0.340   200  0.0146 Preprocessor1_Model15
2     4   300 rsq     standard   0.339   200  0.0146 Preprocessor1_Model12
3     3   200 rsq     standard   0.336   200  0.0147 Preprocessor1_Model08
4     5   100 rsq     standard   0.335   200  0.0149 Preprocessor1_Model13
5     5   200 rsq     standard   0.335   200  0.0146 Preprocessor1_Model14
best_rf <- rf_fit %>%
  select_best("rsq")
best_rf
# A tibble: 1 × 3
   mtry trees .config              
  <int> <int> <chr>                
1     5   300 Preprocessor1_Model15
# Final workflow
final_wf_rf <- rf_wf %>%
  finalize_workflow(best_rf)
final_wf_rf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_dummy()
• step_normalize()
• step_naomit()
• step_rm()
• step_indicate_na()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 5
  trees = 300

Computational engine: ranger 
# Fit the whole training set, then predict the test cases
final_fit_rf <- 
  final_wf_rf %>%
  last_fit(correct_split)
final_fit_rf
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits              id               .metrics .notes   .predictions .workflow 
  <list>              <chr>            <list>   <list>   <list>       <list>    
1 <split [3207/1069]> train/test split <tibble> <tibble> <tibble>     <workflow>
# Test metrics
final_fit_rf %>% 
  collect_metrics()
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       0.150 Preprocessor1_Model1
2 rsq     standard       0.652 Preprocessor1_Model1

1.7 Boosting forecaster (30pts)

  • Use the same features as in AR(\(L\)) for the boosting. Tune the boosting algorithm and evaluate the test performance.

  • Hint: Workflow: Boosting tree for Prediction is a good starting point.

boost_recipe <- recipe(log_volume ~ ., data = NYSE_other) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_normalize(all_numeric_predictors(), -all_outcomes()) %>%  
  step_naomit(all_predictors()) %>%
  prep(data = NYSE_other)
boost_recipe
gb_mod <- 
  boost_tree(
    mode = "regression",
    trees = 1000, 
    tree_depth = tune(),
    learn_rate = tune()
  ) %>% 
  set_engine("xgboost")
gb_mod
Boosted Tree Model Specification (regression)

Main Arguments:
  trees = 1000
  tree_depth = tune()
  learn_rate = tune()

Computational engine: xgboost 
boost_workflow <- workflow() %>%
  add_model(gb_mod) %>%
  add_recipe(boost_recipe %>% step_rm(date) %>% step_indicate_na())
boost_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_dummy()
• step_normalize()
• step_naomit()
• step_rm()
• step_indicate_na()

── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (regression)

Main Arguments:
  trees = 1000
  tree_depth = tune()
  learn_rate = tune()

Computational engine: xgboost 
folds <- NYSE_other %>% arrange(date) %>%
    sliding_period(date, period = "month", lookback = Inf, assess_stop = 1)
  # rolling_origin(initial = 5, assess = 1)


month_folds <- NYSE_other %>%
  sliding_period(
    date,
    "month",
    lookback = Inf,
    skip = 4)
boost_grid <- grid_regular(
  tree_depth(range = c(1L, 2L)), 
  learn_rate(range = c(-3, -2), trans = log10_trans()), 
  levels = c(2, 3) 
)
boost_grid
# A tibble: 6 × 2
  tree_depth learn_rate
       <int>      <dbl>
1          1    0.001  
2          2    0.001  
3          1    0.00316
4          2    0.00316
5          1    0.01   
6          2    0.01   
boost_fit <- tune_grid(
  boost_workflow,
  resamples = month_folds,
  grid = boost_grid,
  metrics = metric_set(rmse, rsq)
)
boost_fit
# Tuning results
# Sliding period resampling 
# A tibble: 200 × 4
   splits           id       .metrics          .notes          
   <list>           <chr>    <list>            <list>          
 1 <split [98/22]>  Slice001 <tibble [12 × 6]> <tibble [0 × 3]>
 2 <split [120/20]> Slice002 <tibble [12 × 6]> <tibble [0 × 3]>
 3 <split [140/22]> Slice003 <tibble [12 × 6]> <tibble [0 × 3]>
 4 <split [162/22]> Slice004 <tibble [12 × 6]> <tibble [0 × 3]>
 5 <split [184/20]> Slice005 <tibble [12 × 6]> <tibble [0 × 3]>
 6 <split [204/23]> Slice006 <tibble [12 × 6]> <tibble [0 × 3]>
 7 <split [227/18]> Slice007 <tibble [12 × 6]> <tibble [0 × 3]>
 8 <split [245/21]> Slice008 <tibble [12 × 6]> <tibble [0 × 3]>
 9 <split [266/22]> Slice009 <tibble [12 × 6]> <tibble [0 × 3]>
10 <split [288/19]> Slice010 <tibble [12 × 6]> <tibble [0 × 3]>
# ℹ 190 more rows
boost_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "rsq") %>%
  ggplot(mapping = aes(x = learn_rate, y = mean, color = factor(tree_depth))) +
  geom_point() +
  geom_line() +
  labs(x = "Learning Rate", y = "CV AUC") +
  scale_x_log10()
# A tibble: 12 × 8
   tree_depth learn_rate .metric .estimator  mean     n std_err
        <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl>
 1          1    0.001   rmse    standard   0.257   200 0.00562
 2          1    0.001   rsq     standard   0.203   200 0.0120 
 3          2    0.001   rmse    standard   0.251   200 0.00502
 4          2    0.001   rsq     standard   0.239   200 0.0126 
 5          1    0.00316 rmse    standard   0.160   200 0.00356
 6          1    0.00316 rsq     standard   0.253   200 0.0129 
 7          2    0.00316 rmse    standard   0.150   200 0.00313
 8          2    0.00316 rsq     standard   0.303   200 0.0141 
 9          1    0.01    rmse    standard   0.144   200 0.00307
10          1    0.01    rsq     standard   0.322   200 0.0143 
11          2    0.01    rmse    standard   0.136   200 0.00283
12          2    0.01    rsq     standard   0.369   200 0.0153 
   .config             
   <chr>               
 1 Preprocessor1_Model1
 2 Preprocessor1_Model1
 3 Preprocessor1_Model2
 4 Preprocessor1_Model2
 5 Preprocessor1_Model3
 6 Preprocessor1_Model3
 7 Preprocessor1_Model4
 8 Preprocessor1_Model4
 9 Preprocessor1_Model5
10 Preprocessor1_Model5
11 Preprocessor1_Model6
12 Preprocessor1_Model6

boost_fit %>%
  show_best("rsq")
# A tibble: 5 × 8
  tree_depth learn_rate .metric .estimator  mean     n std_err .config          
       <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>            
1          2    0.01    rsq     standard   0.369   200  0.0153 Preprocessor1_Mo…
2          1    0.01    rsq     standard   0.322   200  0.0143 Preprocessor1_Mo…
3          2    0.00316 rsq     standard   0.303   200  0.0141 Preprocessor1_Mo…
4          1    0.00316 rsq     standard   0.253   200  0.0129 Preprocessor1_Mo…
5          2    0.001   rsq     standard   0.239   200  0.0126 Preprocessor1_Mo…
best_gb <- boost_fit %>%
  select_best("rsq")
best_gb
# A tibble: 1 × 3
  tree_depth learn_rate .config             
       <int>      <dbl> <chr>               
1          2       0.01 Preprocessor1_Model6
# Final workflow
final_wf_boost <- boost_workflow %>%
  finalize_workflow(best_gb)
final_wf_boost
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_dummy()
• step_normalize()
• step_naomit()
• step_rm()
• step_indicate_na()

── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (regression)

Main Arguments:
  trees = 1000
  tree_depth = 2
  learn_rate = 0.01

Computational engine: xgboost 
# Fit the whole training set, then predict the test cases
final_fit_boost <- 
  final_wf_boost %>%
  last_fit(correct_split)
final_fit_boost
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits              id               .metrics .notes   .predictions .workflow 
  <list>              <chr>            <list>   <list>   <list>       <list>    
1 <split [3207/1069]> train/test split <tibble> <tibble> <tibble>     <workflow>
# Test metrics
final_fit_boost %>% 
  collect_metrics()
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       0.147 Preprocessor1_Model1
2 rsq     standard       0.665 Preprocessor1_Model1

1.8 Summary (30pts)

Your score for this question is largely determined by your final test performance.

Summarize the performance of different machine learning forecasters in the following format.

The summary of the performance of various machine learning forecasters on predicting the daily Log trading volume using the New York Stock Exchange data from 1962 to 1986, with training data up to January 2, 1980, is presented below. The evaluation of model performance was based on the coefficient of determination (R^2), both on cross-validation (CV) and on the unseen test data. The R^2 metric measures the proportion of variance in the dependent variable that is predictable from the independent variables, with values closer to 1 indicating better model performance. | Method | CV \(R^2\) | Test \(R^2\) | |:——:|:——:|:——:|:——:| | Baseline | - |Straw man test R2: 0.35 | | AR(5) | 0.38 | 0.69 | | Random Forest | 0.34| 0.65| | Boosting | 0.37| 0.67| Baseline (Straw man): The baseline model, which uses the previous day’s log trading volume as the prediction for the current day, achieved a R^2 of 0.35 on the test set. This simple model serves as a point of comparison for more complex models.

AR(5): An autoregressive model of order 5 (AR(5)), which uses the past 5 days of log trading volume to predict the current day’s volume, showed an improvement over the baseline, with a CV R^2 of 0.38 and a test R^2 of 0.69. This indicates that incorporating more historical data points helps in improving the prediction accuracy.

Random Forest: The Random Forest model, which is a non-linear and more complex model compared to AR(5), achieved a slightly lower CV R^2 of 0.34 but also a lower test R^2 of 0.65. This suggests that despite its ability to capture complex patterns in the data, it might be overfitting to the training data or not capturing the time series dynamics as effectively as linear models in this specific case.

Boosting: The Boosting model, another sophisticated and flexible ensemble method, showed a CV R^2 of 0.37 and a test R^2 of 0.67. It performed better than the Random Forest model but was still slightly behind the AR(5) model in terms of test performance. Boosting’s iterative approach to correcting errors might be beneficial for this time series forecasting task, although it didn’t outperform the simpler AR(5) model.

Conclusion: The AR(5) model, despite its simplicity, provided the best performance on the test data among the models evaluated. This suggests that for the task of forecasting daily Log trading volume, a linear approach that directly incorporates recent historical values can be very effective. Meanwhile, the more complex Random Forest and Boosting models did not achieve higher performance, highlighting the importance of model selection and the possibility that simpler models may suffice for certain forecasting tasks. ### Extension reading
- MOIRAI: Salesforce’s Foundation Model for Time-Series Forecasting

2 ISL Exercise 12.6.13 (90 pts)

  1. On the book website, www.statlearning.com, there is a gene expres- sion data set (Ch12Ex13.csv) that consists of 40 tissue samples with measurements on 1,000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group.
  1. Load in the data using read.csv(). You will need to select header = F.
  2. Apply hierarchical clustering to the samples using correlation- based distance, and plot the dendrogram. Do the genes separate the samples into the two groups? Do your results depend on the type of linkage used?
  3. Your collaborator wants to know which genes differ the most across the two groups. Suggest a way to answer this question, and apply it here.

import data

library(tidyverse)
library(cluster)
library(ggplot2)
# Load the data
gene_express <- read.csv('../hw6/Ch12Ex13.csv', header = FALSE)

# Few top rows of the dataset:
head(gene_express)
           V1         V2         V3         V4         V5         V6
1 -0.96193340  0.4418028 -0.9750051  1.4175040  0.8188148  0.3162937
2 -0.29252570 -1.1392670  0.1958370 -1.2811210 -0.2514393  2.5119970
3  0.25878820 -0.9728448  0.5884858 -0.8002581 -1.8203980 -2.0589240
4 -1.15213200 -2.2131680 -0.8615249  0.6309253  0.9517719 -1.1657240
5  0.19578280  0.5933059  0.2829921  0.2471472  1.9786680 -0.8710180
6  0.03012394 -0.6910143 -0.4034258 -0.7298590 -0.3640986  1.1253490
           V7          V8          V9        V10        V11        V12
1 -0.02496682 -0.06396600  0.03149702 -0.3503106 -0.7227299 -0.2819547
2 -0.92220620  0.05954277 -1.40964500 -0.6567122 -0.1157652  0.8259783
3 -0.06476437  1.59212400 -0.17311700 -0.1210874 -0.1875790 -1.5001630
4 -0.39155860  1.06361900 -0.35000900 -1.4890580 -0.2432189 -0.4330340
5 -0.98971500 -1.03225300 -1.10965400 -0.3851423  1.6509570 -1.7449090
6 -1.40404100 -0.80613040 -1.23792400  0.5776018 -0.2720642  2.1765620
          V13         V14        V15        V16        V17        V18
1  1.33751500  0.70197980  1.0076160 -0.4653828  0.6385951  0.2867807
2  0.34644960 -0.56954860 -0.1315365  0.6902290 -0.9090382  1.3026420
3 -1.22873700  0.85598900  1.2498550 -0.8980815  0.8702058 -0.2252529
4 -0.03879128 -0.05789677 -1.3977620 -0.1561871 -2.7359820  0.7756169
5 -0.37888530 -0.67982610 -2.1315840 -0.2301718  0.4661243 -1.8004490
6  1.43640700 -1.02578100  0.2981582 -0.5559659  0.2046529 -1.1916480
         V19         V20        V21        V22        V23        V24
1 -0.2270782 -0.22004520 -1.2425730 -0.1085056 -1.8642620 -0.5005122
2 -1.6726950 -0.52550400  0.7979700 -0.6897930  0.8995305  0.4285812
3  0.4502892  0.55144040  0.1462943  0.1297400  1.3042290 -1.6619080
4  0.6141562  2.01919400  1.0811390 -1.0766180 -0.2434181  0.5134822
5  0.6262904 -0.09772305 -0.2997108 -0.5295591 -2.0235670 -0.5108402
6  0.2350916  0.67096470  0.1307988  1.0689940  1.2309870  1.1344690
          V25         V26        V27        V28         V29         V30
1 -1.32500800  1.06341100 -0.2963712 -0.1216457  0.08516605  0.62417640
2 -0.67611410 -0.53409490 -1.7325070 -1.6034470 -1.08362000  0.03342185
3 -1.63037600 -0.07742528  1.3061820  0.7926002  1.55946500 -0.68851160
4 -0.51285780  2.55167600 -2.3143010 -1.2764700 -1.22927100  1.43439600
5  0.04600274  1.26803000 -0.7439868  0.2231319  0.85846280  0.27472610
6  0.55636800 -0.35876640  1.0798650 -0.2064905 -0.00616453  0.16425470
         V31          V32         V33        V34        V35        V36
1 -0.5095915 -0.216725500 -0.05550597 -0.4844491 -0.5215811  1.9491350
2  1.7007080  0.007289556  0.09906234  0.5638533 -0.2572752 -0.5817805
3 -0.6154720  0.009999363  0.94581000 -0.3185212 -0.1178895  0.6213662
4 -0.2842774  0.198945600 -0.09183320  0.3496279 -0.2989097  1.5136960
5 -0.6929984 -0.845707200 -0.17749680 -0.1664908  1.4831550 -1.6879460
6  1.1567370  0.241774500  0.08863952  0.1829540  0.9426771 -0.2096004
          V37        V38         V39        V40
1  1.32433500  0.4681471  1.06110000  1.6559700
2 -0.16988710 -0.5423036  0.31293890 -1.2843770
3 -0.07076396  0.4016818 -0.01622713 -0.5265532
4  0.67118470  0.0108553 -1.04368900  1.6252750
5 -0.14142960  0.2007785 -0.67594210  2.2206110
6  0.53626210 -1.1852260 -0.42274760  0.6243603

2.1 12.6.13 (b) (30 pts)

There are several methods of “dissimilarity” (correlation based distance) which can be explored for creation of distance matrix.

I am using the following method to create a distance matrix: Dissimilarity = 1 - Correlation the as.dist() function is used here to assign the correlation values to be “distances” Correlation-based distance can be computed using the as.dist() function, which converts an arbitrary square symmetric matrix into a form that the hclust() function recognizes as a distance matrix. The hclust() function implements hierarchical clustering in R.

We now plot the hierarchical clustering dendrogram using complete, single, and average linkage clustering using correlation based distance.

dd=as.dist(1- cor(gene_express))

# To plot the dendrograms obtained using the usual plot() function.
# The numbers at the bottom of the plot identify each observation.

plot(hclust(dd, method ="complete"), main=" Complete Linkage
with Correlation - Based Distance ", xlab="", sub ="")

plot(hclust(dd, method ="average"), main=" Average Linkage
with Correlation - Based Distance ", xlab="", sub ="")

plot(hclust(dd, method ="single"), main=" Single Linkage
with Correlation - Based Distance ", xlab="", sub ="")

Inference:

The choice of linkage certainly does affect the results obtained. As we see, we obtain two clusters for complete and single linkages and three clusters for average linkage Typically, single linkage will tend to yield trailing clusters: very large clusters onto which individual observations attach one-by-one. + On the other hand, complete and average linkage tend to yield more balanced, attractive clusters. For this reason, complete and average linkage are generally preferred to single linkage.

2.2 PCA and UMAP (30 pts)

library(FactoMineR)
library(factoextra)

# Applying PCA
pca_res <- prcomp(t(gene_express), scale. = TRUE)

# Plotting PCA results
fviz_pca_ind(pca_res,
             col.ind = rep(c("Healthy", "Diseased"), each = 20), # Color by group
             legend.title = "Group",
             title = "PCA of Gene Expression Data")

UMAP

library(umap)
# Preparing data for UMAP
data_for_umap <- t(gene_express) # Transposing the data may be necessary depending on the function

# Applying UMAP
umap_res <- umap(data_for_umap)

# Plotting UMAP results
df_umap <- as.data.frame(umap_res$layout)
df_umap$Group <- rep(c("Healthy", "Diseased"), each = 20)
ggplot(df_umap, aes(V1, V2, color = Group)) +
  geom_point() +
  theme_minimal() +
  labs(title = "UMAP of Gene Expression Data",
       x = "UMAP 1", y = "UMAP 2", color = "Group")

2.3 12.6.13 (c) (30 pts)

To analyze the gene expression data and identify which genes exhibit the most significant differences between two groups (healthy and diseased), we will employ the prcomp() function for Principal Component Analysis (PCA). This technique reduces the dimensionality while preserving as much of the variation as possible. By setting scale = TRUE, we ensure that each variable is not only centered to have a mean of zero but is also scaled to have a unit standard deviation. This standardization is crucial for comparing features that may have different units or scales of measurement.

The output of prcomp(), specifically the rotation component, contains the loadings of the principal components. These loadings can be interpreted as the contribution or weight of each gene to the principal components. Essentially, they help in understanding how strongly each gene is associated with the different dimensions (principal components) extracted during PCA. In this context, examining the loadings can highlight the genes that differ most significantly between the healthy and diseased groups, as indicated by their contribution to the principal axes of variation within the data.

pr.out <- prcomp(t(gene_express), scale = TRUE)
head(pr.out$rotation)
              PC1         PC2          PC3          PC4          PC5
[1,] -0.005248643  0.02924652 -0.003735358 -0.016526153  0.006318053
[2,] -0.002162728  0.04373711 -0.071475379  0.036521467 -0.010739305
[3,]  0.014444163  0.01361593 -0.011068681 -0.075011306  0.002396073
[4,]  0.014868417 -0.01975169 -0.038690207  0.018872555 -0.007595757
[5,]  0.010600362 -0.02949636  0.016983813  0.005730388 -0.019518632
[6,]  0.030133863  0.03181327  0.016420667  0.016799619  0.037829117
              PC6         PC7           PC8          PC9         PC10
[1,] -0.049838208 -0.01283006  0.0215726238  0.007640568 -0.042478511
[2,]  0.044918065  0.02817240 -0.0372224597 -0.030437604  0.077891794
[3,] -0.073571187 -0.03590103  0.0385994243  0.013067476  0.004386205
[4,] -0.006314036  0.04078925  0.0032669777  0.013650293 -0.003238640
[5,]  0.025142879  0.03502170  0.0357811899  0.036519007 -0.078427052
[6,]  0.076979558  0.01745551  0.0002496599 -0.037883486  0.001399260
            PC11         PC12         PC13          PC14         PC15
[1,] -0.06799323 -0.066325536  0.049657614 -0.0105490714  0.002388065
[2,]  0.03504599  0.007576127 -0.013509593 -0.0120626555  0.017620628
[3,]  0.02675494 -0.004421772 -0.023358127 -0.0002795565  0.005085890
[4,] -0.04521873 -0.084699633 -0.002718848 -0.0170544115 -0.047018603
[5,] -0.06127074 -0.004607158  0.021907325 -0.0174513795 -0.014291851
[6,]  0.06333435  0.001071488  0.031586824  0.0091279070  0.012496721
             PC16         PC17          PC18        PC19         PC20
[1,]  0.057226217  0.005064164 -0.0368384794 -0.01386445  0.008475267
[2,] -0.038753365  0.018177190  0.0065293541  0.02616184 -0.008769863
[3,] -0.008360650 -0.033206319  0.0036294280 -0.02014200 -0.024457400
[4,] -0.027114939 -0.037841235  0.0090504436  0.03994846  0.014224876
[5,] -0.010468740 -0.027149195  0.0487582488  0.02195479  0.060265982
[6,]  0.004923492 -0.003525826  0.0006219705  0.03129817  0.023261190
              PC21         PC22        PC23         PC24         PC25
[1,]  0.0194249905  0.000140247 -0.02989553 -0.076029645 -0.004324215
[2,] -0.0078725765 -0.026677304 -0.03100535 -0.002287257 -0.021849136
[3,] -0.0093182521  0.024043186  0.05819660  0.018947343  0.014286045
[4,]  0.0554784102  0.059728272 -0.04416882  0.030836222  0.002601610
[5,] -0.0000414575 -0.039202866  0.01982787 -0.037371354  0.003398134
[6,] -0.0105442468  0.006799149 -0.06354254  0.019801778  0.007791356
             PC26         PC27        PC28         PC29          PC30
[1,] -0.005802548  0.053495473  0.03324798  0.039595474  0.0008190378
[2,] -0.018643636  0.051218992  0.01319940 -0.005557906 -0.0085837033
[3,] -0.012192009 -0.083097382  0.02438412  0.012882579 -0.0032415950
[4,]  0.018058462  0.041462682  0.05223420  0.003324611  0.0449676174
[5,] -0.027906713  0.014694086 -0.01277924  0.001121646  0.0203340545
[6,] -0.053747211  0.008386746 -0.01655763 -0.005120959 -0.0673679957
            PC31          PC32         PC33         PC34         PC35
[1,]  0.04451014  0.0001689378 -0.025423202 -0.010178476  0.021477278
[2,]  0.06072757 -0.0089395354  0.042380402  0.005784563  0.002459284
[3,] -0.02798607 -0.0406599481 -0.004561497  0.077879450 -0.005489991
[4,]  0.02768043 -0.0086310349  0.014854858 -0.006997894 -0.023272536
[5,] -0.06631350 -0.0079579383  0.033064563 -0.017950552 -0.050971271
[6,] -0.01807539 -0.0222340328  0.016882833  0.005770649  0.067631690
             PC36         PC37        PC38        PC39        PC40
[1,] -0.019328540  0.027045885 -0.01467986 -0.00732158  0.05082225
[2,]  0.042550057  0.004536028  0.01595791  0.02267139  0.33237431
[3,]  0.017625830  0.003143908 -0.02153624 -0.01162287 -0.19231512
[4,]  0.015260530 -0.009337707 -0.03368954  0.01920661  0.17195042
[5,] -0.038077959 -0.003937376  0.02531928 -0.02512450 -0.11504928
[6,]  0.008485763 -0.044439490 -0.01276119 -0.01742770  0.14202453
gene.load <- apply(pr.out$rotation, 1, sum)

gene.differ <- order(abs(gene.load), decreasing = TRUE)

#the top most different genes across the two groups
gene.differ[1:20]
 [1]   2 755 889 676 960 475 907 174 878 840 374 281 673 716 914 370 567 394 984
[20]  28
# Assuming the first 20 samples are healthy and the next 20 are diseased
healthy <- gene_express[1:20, ]
diseased <- gene_express[21:40, ]

# Applying t-test for each gene and storing p-values
p_values <- apply(gene_express, 2, function(gene) t.test(gene[1:20], gene[21:40])$p.value)

# Adjusting for multiple testing (e.g., Bonferroni correction)
p_adjusted <- p.adjust(p_values, method = "bonferroni")

# Identifying genes with significant differences
significant_genes <- which(p_adjusted < 0.05)

# Print significant genes
print(significant_genes)
V29 V39 V40 
 29  39  40 

The PCA approach gives you genes that are important for the variance in the dataset but does not directly tell you about the difference between healthy and diseased samples. The t-test approach provides a direct answer to which genes have significantly different expressions between the two groups. Therefore, if your collaborator’s question is specifically about which genes differ the most between healthy and diseased groups, the t-test approach provides the direct answer: genes 29, 39, and 40 have significantly different expressions between the two groups, as indicated by the adjusted p-values being below 0.05.